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Abstract 

We study numerical methods for sampling probability measures in high dimension 
where the underlying model is only approximately identified with a gradient system. Ex¬ 
tended stochastic dynamical methods are discussed which have application to multiscale 
models, nonequilibrium molecular dynamics, and Bayesian sampling techniques arising in 
emerging machine learning applications. In addition to providing a more comprehensive 
discussion of the foundations of these methods, we propose a new numerical method for the 
adaptive Langevin/stochastic gradient Nose-Hoover thermostat that achieves a dramatic 
improvement in numerical efficiency over the most popular stochastic gradient methods 
reported in the literature. We also demonstrate that the newly established method in¬ 
herits a superconvergence property (fourth order convergence to the invariant measure 
for configurational quantities) recently demonstrated in the setting of Langevin dynamics. 
Our findings are verified by numerical experiments. 


1 Introduction 

Stochastic thermostats [37l[55l[56] are powerful tools for sampling probability measures on 
high-dimensional spaces. These methods combine an extended dynamics with degenerate 
stochastic perturbation to ensure ergodicity. The traditional use of thermostats in molecular 
dynamics is to sample a well-specified equilibrium system involving a known force field which is 
the gradient of a potential energy function. Recently, however, these techniques have become 
increasingly popular for problems of more general form, including the following: 

• multiscale models in which the forces are obtained by approximate sampling in another 
scale regime [17 1 1 ^ ^0 1 144 ( 148 ( ^9] : 

• nonequilibrium physical models in which the potential energy function either is evolving 
or does not completely specify the system [27l[4n[45ll47ll58l[^ : 

• Bayesian machine learning applications in which a dataset defines an objective function 
which leads to an effective force law p [Tn[mi46ll57ll62ll63j . 

In this article, we consider thermostats and numerical methods for sampling an underlying 
probability measure in the presence of error, under the assumption that the errors are random 
with a simple distributional form and unknown, but constant or slowly varying, parameters. 
In the cases considered, these methods are simple to implement, robust, and efficient. 
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1.1 Thermostats 


The main tool that we employ in this article is the general concept of a thermostat as 
a (stochastic) distributional control for a dynamical system. These methods originate in 
molecular dynamics, and it is simplest to explain them in that context. Classical molec¬ 
ular dynamics tracks the motion of individual atoms determined by Newton’s law in the 
microcanonical {NVE) ensemble, where energy (i.e., the Hamiltonian of the system) is al¬ 
ways conserved [HIISlEniES]- However, constant energy is not the appropriate setting of a 
real-world laboratory environment. In most cases, one wishes instead to sample the canon¬ 
ical (NVT) ensemble, where temperature, as an intensive variable, is conserved, by using 
thermostat techniques [T81I25] . 

The idea of a thermostat is to modify dynamics so that a prescribed invariant measure 
is sampled. There are competing aims in this type of work. For example, one may wish 
to perturb the underlying Newtonian dynamics minimally, so that temporal correlations are 
preserved, or one may be interested in sampling rare events in a system with metastable states; 
thus a variety of methods have been developed. The most obvious proposals, and also the 
oldest, are Brownian and Langevin dynamics. In Brownian (sometimes called “overdamped 
Langevin”) dynamics, the system is 

dq=-XVU{q)dt + ^/2P^dW, (1) 

where q represents a 3A^-dimensional vector of time-dependent random variables, dW rep¬ 
resents a vector of infinitesimal Wiener increments, /3 is a positive parameter (proportional 
to the reciprocal temperature), U is the potential energy function, and A is a free parame¬ 
ter which represents a time-rescaling. It can be shown [9] that this system m ergodically 
samples the Gibbs-Boltzmann probability distribution pj^ oc exp(—/3[/). For simplicity, we 
assume that the configurations q are restricted to a compact and simply connected domain 
Hg. In molecular dynamics applications, the starting point is the potential energy function, 
which is usually assumed to be a semiempirical formula constructed from primitive functions 
via an a priori parameter fitting procedure. Alternatively, one may assume that it is the 
probability distribution that is specihed and that the potential energy is constructed from it 
via 

U = —Inp, 

which, of course, requires that p > 0. In many applications it is found that the use of a first 
order dynamics such as ([T]) is inefficient or introduces unphysical dynamical properties, and 
one employs, instead, the Langevin dynamics method: 

dq = M-^pdt , (2) 

dp = -VU{q)dt - jpdt + ^/2p^M^/‘^dW . (3) 

Again, 7 in these equations is a free parameter, termed the “friction constant”. It is related 
to the timescale on which the variables of the system interact with particles of a fictitious 
extended “bath”, but it cannot be associated with a simple time-rescaling of the equations of 
motion and is thus different from A in m- It IS a little luore involved to shoiv that (j2p (j3p 
ergodically [32] samples the distribution with density pp oc exp(—/3H(q,p)), where H{q,p) = 
p'^M~^p/2 + U{q). In molecular dynamics, the 2)N x 3A^ matrix M is typically diagonal and 
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contains the masses of atoms, p represents the momentum vector, and H is the Hamiltonian 
or energy function. In more general settings, the masses and friction coefficient may be treated 
as free parameters, and by computing long trajectories of ([2])-(l3|), one may obtain averages 
with respect to p/siq); i.e., if {(q(r),p(r)) : r > 0} is a path generated by solving the SDE 
system ©-([S]), one has, for suitable test functions 4>{q), and under certain conditions on the 
potential energy function U [32], 

lim [ (j){q{T))dT= [ (/>(q)p^(q) dwg , 

where dcoq = dqidq 2 ■ ■ -dq^- In other words, the projected path defines a sampler for the 
density p/ 3 . 

Langevin dynamics can thus be seen as an extended system which allows sampling to be 
performed in a reduced cross section of phase space by marginalization over long trajectories; 
this is the essential property of a thermostat. Other types of thermostats include Nose- 
Hoover-Langevin (NHL) dynamics [371[55| and various generalized schemes (see, e.g., [32]). 
In these methods, one adds additional auxiliary variables which are meant to control the 
dynamics (via a negative feedback loop), and the auxiliary variables are then further coupled 
to stochastic processes of Ornstein-Uhlenbeck type which can provide ergodicity [37j. (Note 
that the use of purely deterministic approaches, such as Nose-Hoover, results in ergodicity 
issues msu) The use of auxiliary variables can provide a degree of flexibility in the design of 
the thermostat, for example, allowing the treatment of systems arising in fluid dynamics [l 6 | 
or imposing an isokinetic constraint [33]. Very recently, we have further generalized the NHL 
method to obtain pairwise Nose-Hoover-Langevin (PNHL), which is a momentum-conserving 
thermostat and thus applicable to the simulation of hydrodynamic behavior in complex fluids 
and polymers in mesoscales [39] . 

1.2 Noisy Gradients 

The gradient (or Hamiltonian) structure is essential to the nature of all the methods described 
above since it is only by use of this feature that the underlying Fokker-Planck equation can be 
shown to have the desired steady state solution. However, in many applications, in particular 
multiscale modelling, the force is corrupted by significant approximation error and cannot be 
viewed as the gradient of a single global potential function. One imagines a large extended 
system involving configurational variables q and y, with {q,y) € x fly (compact), and an 
overall distribution described by a Gibbs-Boltzmann density 

p(Q,y) = ^“^exp (^-/3f7(q,y)) , 

where Z is a normalizing constant so that p is a probability density. One calculates the mean 
force acting on q, /(q), by averaging the forces in the extended Gibbsian system, f{q,y), as 

Kq) = fiQ, y)p{q, y) dwy. 

If, as would typically be assumed, f{q,y) = —'^qU{q,y), i.e., the force in the extended 
system is conservative, then we may interpret / as a conservative force as well, specifically 
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the gradient of the potential of mean force, which is given by 

exp dujy . 

The challenge arises when this integral must be approximated. For example, if this is 
done by Monte Carlo integration, for fixed q, one generates samples ,... from the 

distribution with density p{q, y) and thus approximates the mean force by 

k 

i=l 

In practice most systems constructed in this way, for example, those arising in mixed quantum 
and classical molecular models [7], will admit very substantial errors in the forces; that is, 

fHq) = fiQ) + ^HQ)- 

Depending on the method of computation, it may be reasonable to assume that the errors 
are normally distributed with zero mean, which is justified by the central limit theorem [5], 
but the variance of the errors is generally not known and will be dependent on the location q 
where they are computed; thus we would expect 

A^(q)~AA(0,S'=(q)) , (4) 

where is an unknown covariance matrix. It should be noted that the assumption of the 

errors being Gaussian distributed is also often adopted in Bayesian inverse problems m and 
elsewhere. 

The most straightforward approach to the problem is to first treat the estimation problem 
for 5]^ separately, by some means, and then to use this within a standard Brownian or 
Langevin dynamics algorithm. The difficulty is that this requires a high level of local accuracy 
in the calculations, which is likely to be burdensome and involve redundant computation. 
What we would prefer to do is to resolve the correct target distribution by a global calculation. 

This problem has recently been encountered in the data science community, where it has 
attracted considerable attention To illustrate, we consider the problem 

of Bayesian sampling mm , where one is interested in correctly drawing states from a posterior 
probability density defined as 

7r(6>|X) oc 7 r(X|6>)7r(6>), (5) 

where 0 is the parameter vector of interest, X represents the entire dataset, and, 7r(X|0) and 
7r(0) represent the likelihood and prior distributions, respectively. In these applications, the 
distribution parameters are interpreted as the configuration variables [6 = q). We introduce 
a potential energy U{0) by defining 7r(0|X) oc exp(—,0?7(0)); thus taking the logarithm of (j5|) 
gives 

C/(0) = — log7r(X|0) — log7r(0). (6) 

Assuming the data are independent and identically distributed (i.i.d.), the logarithm of the 
likelihood distribution can then be calculated as 

N 

log7r(X|0) = ^log7r(xi|0), (7) 

i=l 
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where N is the size of the entire dataset. 

However, in machine learning applications, one often finds that directly sampling with the 
entire large-scale dataset is computationally infeasible. For instance, standard Markov chain 
Monte Carlo (MCMC) methods [33] require the calculation of the acceptance probability and 
the creation of informed proposals based on the whole dataset, while the gradient is evaluated 
through the whole dataset in the hybrid Monte Carlo (HMC) method [811151123] . again resulting 
in severe computational complexity. 

In order to improve the efficiency of simulation, the so-called stochastic gradient Langevin 
dynamics (SOLD) was recently proposed [63] based on using a random (and much smaller, 
i.e., h <C N) subset to approximate the likelihood of the dataset for given parameters, 

N 

log7r(X|0) — ^log7r(xrJ0), (8) 

i=\ 

where represents a random subset of X. Overall, the “noisy” potential energy now 

can be written as 

N 

U{0) = —^ ^log7r(xrJ0) - log7r(0), (9) 

i=\ 

with “noisy” force F{Q) = —VU{0). 


1.3 Sampling Methods for Noisy Gradients 

The challenge is to identify a method to compute samples distributed according to the Gibbs 
distribution p{q) = Z~^ exp(—/317(q)), where the only available information is a stochastically 
perturbed force F{q) defined in the previous section. 

In the original SOLD method, samples are generated by Brownian dynamics, 

qn+l = qn + AtnF{qn) + y/2f3-^AtnRn , (10) 


where R„ is a vector of i.i.d. standard normal random variables. It should be emphasized 
that Atn is a sequence of stepsizes decreasing to zero [63] . Although a central limit theorem 
associated with the decreasing stepsize sequence was established by Teh et al. [HI], a hxed 
stepsize is often preferred in practice, which is the choice in this article as in Vollmer et al. [62], 
where a modified SOLD (mSGLD) is introduced: 


qn+i = qn + AtF{qn) -|- ^2/3-1 At (I - —CovF(q„) j R„ , (II) 


where 


CovFij = E 


(f) - E(F,)) {Fj - ]E(F, 


T' 


( 12 ) 


is the covariance matrix of the noisy force. 

A stochastic gradient Hamiltonian Monte Carlo (SGHMC) method was also proposed 
very recently by Ghen et al. which incorporates a parameter-dependent diffusion matrix 
5](q) (i.e., the covariance matrix of the noisy force). S(q) is intended to effectively offset the 
stochastic perturbation of the gradient. However, it is very difficult to accommodate S(q) in 
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practice; moreover, as pointed out in m, poor estimation of it may have a significant adverse 
influence in correctly sampling the target distribution unless the stepsize is small enough. 

These problems challenge the conventional mechanism of thermostats. An article of Jones 
and Leimkuhler [26] provides an alternative means of tackling this problem by showing that 
Nose-Hoover dynamics is able to adaptively dissipate excess heat pumped into the system 
while maintaining the Gibbs (canonical) distribution. In the setting of systems involving a 
driving stochastic perturbation, the adaptive Nose-Hoover method is referred to as Ad-NH, 
with similar generalizations of Nose-Hoover-Langevin (Ad-NHL) and Langevin dynamics 
(Ad-Langevin) available. An idea equivalent to Ad-Langevin was very recently applied in the 
setting of Bayesian sampling for use in data science calculations by Ding et al. [Hj, which they 
referred to as the stochastic gradient Nose-Hoover thermostat (SGNHT). It showed significant 
advantages over alternative techniques such as SGHMC P. However, the numerical method 
used by Ding et al. m is not optimal, neither in terms of its accuracy (measured per unit 
work) nor its stability (measured by the largest usable stepsize). 

Although extended systems have been increasingly popular in molecular simulations, the 
mathematical analysis of the order of convergence, specifically in terms of the bias in averaged 
quantities computed using numerical trajectories, is not fully understood. Using a splitting 
approach, we propose in this article an alternative numerical method for Ad-Langevin simula¬ 
tion that substantially improves on the existing schemes in the literature in terms of accuracy, 
robustness, and overall numerical efficiency. 

The rest of the article is organized as follows. In Section |2[ we describe the construction 
of adaptive formulations for noisy gradients including the Ad-Langevin/SGNHT method. 
Section [3] considers the construction of numerical methods for solving the SDEs. Numerical 
experiments are performed in Section 01 Our experiments are of a more limited nature in com¬ 
parison with those of Ding et al. m, but we believe them to be representative of performance 
on a signihcant class of problems. Finally, we summarize our findings in Section [5l 

2 Adaptive Thermostats for Noisy Gradients 

In this section, we discuss the construction of thermostats to approximate samples with respect 
to the target measure (i.e., the correct marginalized Gibbs density) if the covariance matrix 
of the noisy force is constant, i.e., S(q) = a‘^1 {a is a constant positive quantity). The 
procedure was outlined in the paper of Jones and Leimkuhler |26| and relies on the fact that a 
hxed amplitude noise perturbation engenders a shift of the auxiliary variable in the extended 
stationary distribution associated with the Nose-Hoover thermostat. 

If the system is not coming from a Newtonian dynamics model, then it is unclear that we 
need to rely on second order dynamics for this purpose. To see why this is the case, we explain 
what goes wrong if we try to use hrst order dynamics. In what follows, we assume that the 
covariance matrix of the noisy force is constant, although we ultimately intend to apply the 
method more generally (see recent work on a novel covariance-controlled adaptive Langevin 
thermostat that can handle parameter-dependent noise in [57]). Even in the constant a case 
it is a nontrivial problem to extract statistics related to a particular target temperature, since 
we do not assume that a is known. 
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For a constant, let us first consider the SDE 


dq =-^VC/(q)dt + udW , (13) 

dC = X(q)dt (14) 

and seek x(') so that an extended Gibbs distribution with density of the form = 

P/3{q)^{0 is (ergodically) preserved. The variable ^ is an auxiliary variable. We do not 
generally care what its distribution is, but it is crucial that 

(i) the overall density is in product form, and 

(ii) (p{^) > 0 is normalizable and of a simple, easily sampled form. 

These conditions ensure that we can easily average out over the auxiliary variable to compute 
the averages of functions of q which are of greatest interest. 

Proposition 1. Let x{q) = -13 ^3\U{q) + \\VU{q)f; then ([H])-([M]) preserves the modified 
Gibbs distribution 

where 7 = I2. 

Proof. The Fokker-Planck equation corresponding to (|13I) - (I14I1 is 

pt = := CV • (Vt/(q)p(q,0) + yAp - ^(x(q)p). 

Just insert p into the operator d to see that it vanishes. □ 

Proposition 1 tells us that if we can solve system (IH-dMl), under an assumption of 
ergodicity, we can compute averages with respect to the target Gibbs distribution without 
actually knowing the value of a. a could be observed retrospectively by simply averaging ^ 
during simulation, since {(,) = I3dj2. 

The problem is that the dynamics (I13l) - (ll4p is not quite what we want. A typical numerical 
method for this system might be constructed based on modification of the Euler-Maruyama 
method: 


qn+i = qn- Atf,nVU{qn) + adAPRn , (15) 

?n+l — fn 3- Atx{qn) i (15) 

however, observe that this method requires separate knowledge of \/U{q) and a, which is 
generally impossible a priori, as we assume that the force is polluted by unknown noise. The 
form of the equations means that we evaluate the product of ^ and the deterministic force, 
on the one hand, and the random perturbation, on the other hand, separately, and these 
contributions are independently scaled by At and yfdt, respectively. 
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2.1 The Adaptive Langevin (Ad-Langevin) Thermostat 

To adaptively control the invariant distribution, we consider the following second order for¬ 
mulation, which was first introduced in the paper of Jones and Leimkuhler [26]: 

dq = M~^pdt , 

dp = F{q)dt - Cpdt + UAM^/'^dWA , (17) 

d^ = [p^M~^p — AdfesT] dt. 

In these equations, F{q) is meant to represent a noisy gradient which may be thought of as 
being defined by the relation 

F{q) = -VU{q) + , (18) 

where R = R(t) is a collection of independent Gaussian white noise processes, i.e., (Rj(t)Rj(s)) 
6ij6{t—s), where 5ij is the Kronecker delta and 6{t—s) is the Dirac delta function. aAM^/'^d'W a 
indicates the artificial noise added into the system to enhance the ergodicity; i.e., the constant 
a A is known a priori. All the components of the Wiener process WA(t) are assumed to be 
independent. denotes the number of degrees of freedom of the system, /r is a coupling 
parameter which is referred to as the “thermal mass”. k-Q and T, satisfying the relation 
= /cbT, represent the Boltzmann constant and system temperature, respectively. 

A similar system (SGNHT) was used by Ding et al. [14], who also explored its application 
to three examples from machine learning. These experiments demonstrated that Ad-Langevin 
has superior performance compared to SGHMC in various applications, confirming the im¬ 
portance of adaptively dissipating additional noise in sampling. However, there remain two 
important issues that we wish to address in this article; ( 1 ) the underlying dynamics of the 
Ad-Langevin method is not clear due to the presence of the stochastically perturbed gradient; 
( 2 ) little attention has been paid to the design of optimal numerical methods for implementing 
Ad-Langevin with attention to stability and numerical efficiency. 

One may wonder why the artificial noise is needed (i.e., cta 7 ^ 0), since we are assuming 
the presence of noise in the gradient itself. The reason is as follows: in defining a numerical 
method for the noisy gradient system, the force (including the random perturbation) will in 
general be multiplied by At, where At is the timestep. On the other hand, the Ito rule implies 
that the scaling of random perturbations in an SDE should be by a factor proportional to 
\/At; thus, effectively, if we are to relate the thermostatted method to a standard SDE, the 
standard deviation of the noise is reduced by multiplication by the factor \/At. The noise 
perturbation introduced at each timestep (and the effective diffusion) is thus reduced for small 
stepsizes and it is therefore important to inject additional artificial noise in order to stabilize 
the invariant distribution. A rewriting of the Ad-Langevin system as a standard ltd SDE 
system makes clear the relation between the different terms 

dq = M~^pdt , 

dp = -VU{q)dt + cj\fKtM^/^dW - ipdt + ctaM^/^^Wa , (19) 

d.^ = \jp"M~^p — Ad^eT] dt, 

where W = W(t) is an additional vector of standard Wiener processes. 

Let us note the main features of the dynamics (|19p : 


(i) The equations are a combination of Langevin dynamics and Nose-Hoover dynamics. If 
^ is constant in the equation for the momentum, then the system reduces to Langevin 
dynamics. In the absence of noise, ua = 0 (and a = 0); then the system reduces to 
Nose-Hoover. The system (|19l) may be regarded as a sort of Langevin dynamics where 
the friction coefficient, rather than being fixed a priori, is automatically and adaptively 
determined in order to achieve the desired temperature (which is specified in the control 
law defining the evolution of ^). 

(ii) The invariant distribution for the given system may be directly obtained by study of its 
Fokker-Planck equation. Following [26], it is straightforward to show that (|19[) has the 
following invariant distribution: 

/0/3(g,P,0 = ^exp(-/3iL(q,p))exp (^-^(^-7)2^ , (20) 


where Z is the normalizing constant and 

„ ^ (4 + ai) 

^ =- 2 - 


( 21 ) 


where cjf = ay/Kt. Observe that this means that if cja = 0, then, as limAt-i.oO'F = 
0, we find that ^ tends to a variable which is normally distributed with mean zero. 
Alternatively, if cja / 0, one would obtain 

^ ^ , t ^ oo , At ^ 0 , 


where is the variance and the symbol —)• indicates that ^ converges in probability 

law to a normally distributed random variable with the indicated parameters. The order 
of the limits here is important: t —)■ oo first (to reach the invariant distribution), then 
At 0. 

(hi) The ergodicity of (fT^ with respect to the distribution indicated above can easily be 
demonstrated by reference to Hormander’s condition for hypoellipticity following the 
method in [42j . as for Langevin dynamics. The only additional step is to verify that 
the noise propagates into the ^ variable, which follows due to its strong coupling to the 
momenta. 

(iv) This dynamics is a bit unusual in that it must be viewed as stepsize dependent, al¬ 
though we mention that such mixed systems are used in the study of backward error 
analysis [38]. One simply thinks of the characteristics of stochastic paths associated 
with (fT^ as being stepsize dependent. Although (fT^ takes on the appearance of a 
standard Ito SDE system, we must bear in mind that in discretizing these equations the 
conservative force F{q) and the associated noise term cr\/^Af^/^dW must be evalu¬ 
ated together at every stage, since the formulation (fT9]) is a notational device to make 
clear the properties of the system. 
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3 Numerical Methods for Adaptive Thermostats 

Since stochastic systems in most of the cases cannot be solved “exactly”, splitting methods are 
often adopted in practice. For instance here, the vector field of the Ad-Langevin/SGNHT (1171) 
can be split into four pieces which are denoted as “A”, “B”, “O”, and “D”, in such a way 
that each piece can be solved “exactly”, 


’ q ’ 

p 

_ 

o 1 

_ 1 

di “h 

0 

-V[/(q) + aMV2R 

dt 

0 

-Cpdt + (TAM^/^dWA 

+ 

0 

0 

_ e . 


0 


0 


0 


. G{p) _ 



where G{p) = — Vd/csF] . 

Clearly parts “A” and “D” can be solved “exactly”. As mentioned previously, the under¬ 
lying dynamics for “B” is 

dp = -VU{q)dt + apM^/^dW , (22) 

where q is fixed and (jp = a^/At. Integrating (1221) from 0 to At gives the exact solution in 
distribution of this part as 

p{At) = p(0) — AtVU{q) + -\/At(TFAf^/^R 

= p(0) + At[-VUiq) + = p(0) + AtF{q ), 

where R is a vector of i.i.d. standard normal random variables. It should be noted that 
applying the Euler-Maruyama method to (j22j) gives the same result; thus, for constant force, 
Euler-Maruyama is “exact”. 

The “O” or “Ornstein-Uhlenbeck” part is usually stated with ^ a positive constant, in 
which case the solution is found to be |29] 

^ /-I _ p-2iAt 

p{At) = e“^^‘p(0) o-aW- ^ -M^/2r , (23) 

where p(0) is the initial value of the variable and R is a vector of i.i.d. standard normal 
random variables. However, the same formula (j23p is easily seen to be valid for < 0, 
since the quantity (1 — e“^^^*)/(2^) is strictly greater than zero unless = 0. (The proof 
is obtained by following the standard procedure [29] ■) When ^ = 0, one can simply replace 
(1 — e“^^^^)/(2^) by its well-defined asymptotic limit, 

p{At) = p(0) -|- (24) 

The generators associated with each piece are defined, respectively, as 

£a = • Vq , 

2 

£b = -VU{q) • Vp + {MVD , 

2 

Co = -iP • Vp + ^Tr {MVl) , 
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where ap = a^/At in part “B” is stepsize dependent. 

Overall, the generator of the Ad-Langevin/SGNHT (1171) system can be written as 


C, — Cp^ + Cp, + Cq + Cp . 


(25) 


The flow map (or phase space propagator) of the system can be written in the shorthand 
notation 

, 

where the exponential map here denotes the solution operator. Approximations of Ft can be 
obtained as products (taken in different arrangements) of exponentials of the splitting terms. 
For example, the phase space propagation of the method proposed by Ding et al. m for the 
Ad-Langevin/SGNHT (fT7)) system (denoted as “SGNHT-N”) can be written as 

exp (^At£sGNHT-N) = exp (At£p) exp {MCa) exp {MCp) , (26) 

where 

£p = £b + (27) 

and exp(At£j) represents the phase space propagator associated with the corresponding 
vector field /. Because of its nonsymmetric structure, one anticipates first order convergence 
to the invariant measure (for any choice of cr). Due to the naming of the component parts, 
the SGNHT-N method may be denoted by “PAD”. 

Overall, the SGNHT-N/PAD integration method is as follows: 

Pn+l = Pn + At (^-VU{qn) “h - At^nPu + VAt(TAM^/^R„ , 

Qn+l = qn + AtM" Vn+1 , 

?n+l = “ NdkpT) , 


where R(j and R„ are vectors of i.i.d. standard normal random variables. 

We propose symmetric alternative methods, such as the following symmetric Ad-Langevin/ 
SGNHT (SGNHT-S) splitting method: 

g^t.^SGNHT-S — ^ 28 ) 

where exact solvers for parts “B” and “O” derived above are applied. The SGNHT-S method 
may be referred to as “BADODAB”, where it should be noted that the various operations 
are symmetrically applied and the steplengths are uniform and span the interval At. Other 
symmetric splittings are considered below. 
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The SGNHT-S numerical integration method may be written as 

Pn+l/3 =Pn + (At/2) (^-VU{qn) + , 

Qn+ll2 = Qn + {At/2)M ^Pn+l/3 , 

Cn+1/2 =Cn+ (At/2)/i"^ (Pn+l/3M"Vn+l/3 “ , 

if (en+1/2 / 0) : Pn+2/3 = Wl/3 + /i2^r^+l/2) , 

else : Pn+ 2/3 = Pn+ 1/3 + , 

Cn+l = ^n+l/2 + (At/2)/X ^P^_|_ 2 ^ 3 M Pn+2/3 “ A^/cbT^ , 

q„+l = qn+l /2 + (At/2)iVfVn+2/3 > 

Pn+1 = Pn+2/3 + (At/2) {-VU{qn+l) + . 

The force computed at the end of each timestep can be reused at the start of the next step; 
thus only one force calculation is needed in SGNHT-S at each timestep, the same as for 
SGNHT-N. In practice, one could replace the exponential and square root operations in the 
exact solver of the “O” part by their respective well-defined asymptotic expansions to reduce 
the computational cost. 

3.1 Order of Convergence of Ad-Langevin/SGNHT 

The analysis of the accuracy of ergodic averages (averages with respect to the invariant mea¬ 
sure) in stochastic numerical methods can be performed using the framework of long-time 
Talay-Tubaro expansion, as developed in [Tl[2lll3ll331[36l[60] . In what follows we compare the 
order of convergence of the two Ad-Langevin/SGNHT methods with a clean gradient. 

For a splitting method described by £ = Ca+Ci 3 + - ■ ■+Cq, we define the effective operator 
associated with the perturbed system obtained using the numerical method with stepsize 
At by the relation 

exp = exp ^At£f ^ exp ^At£^^ ... exp ^At£^^ . 

This operator can be computed using the Baker-Gampbell-Hausdorff (BGH) expansion and 
can thus be viewed as a perturbation of the exact Fokker-Planck operator 

£f = + AtC\ + At^c\ + 0{At^) (29) 

for some perturbation operators C\. 

We also define the invariant distribution p associated with the numerical method as an 
approximation of the target invariant distribution pp: 

p = pp[l + Atfi + At‘^f2 + At^fs + 0(At^)] (30) 

for some correction functions fi satisfying (/j) = 0. 

Substituting O and p into the stationary Fokker-Planck equation 

= 0 
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yields 


£.2 + 0(Ai^)^ 1^1 + At/i + At^/2 + At^/3 + 0(At^)j) — 0. 

Since the exact Fokker-Planck operator preserves the invariant canonical distribution, i.e., 
= 0, we obtain 

= -^\p0 (31) 

by equating first order terms in At. 

For any particular integration scheme it is possible to hnd the perturbation operator C\ 
by using the BCH expansion. Then we can calculate its action on pj^. The last step, namely 
obtaining the leading correction function /i, requires the solution of the above PDF (see 
examples in Langevin dynamics |34] 1. In general, solving for /i in closed form is difficult, and 
it does not get simpler as we consider, as here, more complicated formulations than Langevin 
dynamics and more complicated splittings. 

According to the BCH expansion, for (noncommutative) linear operators X and H, we 
have 

exp(AtX) exp(Aty) = exp(AtZi), 

where 

At 

z, = x + Y + ^[X,y] + —([A, [A,y]] - [y, [A,y]]) + o{xt ^), (32) 

and subsequently 


exp exp(Aty)exp 


exp(AtZ2), 


where 

Z2 = A + y + ^ ([y, [y, a]] - i[A, [a, y]]) + o(At^). (33) 

The notation [A, y] = Ay — yA denotes the commutator of operators A and Y. 

These equations demonstrate that for nonsymmetric splitting methods, there typically 
exists a nonzero term C\ oc [A, y] / 0, while the condition C\ = 0, implying /i = 0, 
is automatically satished for symmetric splitting methods; thus, for observables 
assuming the asymptotic expansion holds, the computed average would be of order two 

= {4>) + At((/)/i) + At‘^{(j)f2) + • • • = {4>) + O(At^), 


where (•) denotes the average with respect to the target invariant distribution. Therefore, the 
SGNHT-S method (j28p would have second order convergence for all the observables. 

We can work out the leading operator C\ associated with the nonsymmetric SGNHT-N/PAD 
method (l26]) of Ding et al. [H], 


C 


t 

1,PAD 


(\r) r) 


+ 


r) rt 


+ 


r) rl 1 'l 


(34) 


It is clear that the leading term /i,pad in the perturbed distribution (f30]l is in general nonzero. 
Therefore the nonsymmetric SGNHT-N/PAD method would be expected to exhibit first order 
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convergence to the invariant measure. It should be noted that if certain conditions are sat¬ 
isfied, higher order convergence to the invariant measure would be possible as demonstrated 
by Abdulle et al. m- However, it can be easily demonstrated that it is not the case here for 
the SGNHT-N/PAD method. In the presence of a noisy gradient, the Ad-Langevin/SGNHT 
methods, despite the stepsize dependency (fT9|) . would similarly (and generally) be expected 
to be first order with respect to the invariant distribution. 


3.2 Superconvergence Property 

Recently, it has been demonstrated in the setting of Langevin dynamics that a particular 
symmetric splitting method (“BAOAB”), which requires only one force calculation per step, 
is fourth order for configurational quantities in the ergodic limit and in the limit of large 
friction [3lll36]. 

In what follows we demonstrate that the newly proposed SGNHT-S/BADODAB method (12811 
effectively inherits the super convergence property of BAOAB in the setting of Ad-Langevin/ 
SGNHT system (1191) with a clean gradient, in case where the parameters cta and /r are both 
taken to infinity in a suitable way. For simplicity, we consider here a one-dimensional model 
H = p^/2 -|- U{q), but the analysis could easily be extended to higher dimensions. 

Following the standard procedure described in Section 13.11 we obtain the following PDF 
associated with the BADODAB method: 


^’^(P/3/2) = -ApI 3 , 

where is the exact Fokker-Planck operator 

£t = _pQ^ + U\q)dp + ^dp{p-) + ^dpp - -{p^ - 

P P 

with invariant measure 

P/3(9,P,0 = ^exp(-/3i/(g,p))exp “7)^) > 

where 7 = (0 = ^2 calculated by using the BCH expansion 


4 = — 
2 12 


c) 


c) c) 




ct I ct 


ct I ct r\ 




rt _i_ rt I ct 


(35) 


(36) 


(37) 


rt I ct I r\ r\ 


1 


c) 


c) c) 




r) 


ct ct I ct 


-b 


c) 


r\ ct I ct I r\ 


whose action on the extended invariant measure reads as 




-PpViq) + 4 / 3 p 7 ® + mPU"{q) + WpU'{q)U"{q) + {l - l3p^) 


PP 


12 


3U''{q) + 4^2 - 16 / 34 ,c 2 _ ^ 6 , ^^4 _ ^^2 ^ 

P 


PP 


+ 7^ (2/3/ - l) p/3 + 7^ 0 - pp . 
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The equation is very complicated, and we have no direct means of solving it. However, 
the additional variable ^ has mean 7 . If we suppose that /i is large, then the variance of ^ will 
be small. In this case we can consider the approximation obtained by replacing functions of 
^ in the PDE (|35ll by their corresponding averages 


«>=7, <a = T + f. = (38) 

|3^i Pn 

We use this as part of an averaging of the stationary Fokker-Planck equation with respect 
to the auxiliary variable. That is, we project the Fokker-Planck equation and its solution by 
integrating with respect to the Gaussian distribution of ^ in the ergodic limit. We can think 
of this is as defining a sort of “subspace projection”; it is related to the Galerkin method that 
is widely used in solving high-dimensional linear systems and PDFs, including Fokker-Planck 
equations [iniEQ]. In this case, we apply the projection operator m 




P0(q,p, 0>^(q,p,0d^ 
fQ^pp(g,p,Od^ 


(39) 


where is an arbitrary function, to the PDF (I35p . Effectively, this results in the reduced 
equation 

^Hppf2) = -ppr^, (40) 

pp 

where the operator O is just the operator reduced by the action of the projection, and 
which acts on functions of q and p; this is nothing other than the corresponding adjoint 
generator of Langevin dynamics. Likewise, /2 is now a function of q and p only. The right- 
hand side simplifies to 


PpV^ =(^ [^pU'{q)U"{q)-p^U'''{q)] + 


PP 


1_ 

12 


W'iq) - 3^p‘^U''(q) + - (6 Pp‘^ - 28p‘^ + 10/3"^) 

P 


where is the Gibbs (canonical) density {ex.p{—/3H{q,p))). 

We consider the high friction limit (7 —>■ 00 ) and expand /2 in a series involving the 
reciprocal friction e = I/ 7 , 

h{q,p) = f2fl{q,p) + sf2,i{q,p) + £^f2,2{q,p) H-, (41) 


with each function / 2 ,i satisfying (/ 2 ,i) = 0. Dividing (I35h by the friction coefficient 7 , we 
obtain ^ 

-|- ecQj (^hfi + £f2,i + pp = —eppV > ( 4 ^^) 

where 

= dp{p-) + I3~^dpp , tIj = -pdq + U'{q)dp . (43) 


We take the high thermal mass limit (/i —>■ 00 ) in such a way that e = 1/// = I/ 7 . The use 
of this limit yields the following terms of the expansion of the right-hand side in powers of e. 
Defining 


-eppV 


PP 


= 9 = (50 + £gi) Pp , 
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we have 


(44) 

(45) 


90 = -\[U'’{q)-(5p^U’'{q)] , 

91 = [WpU'{q)U'\q) - (3p^U'"{q) + 6/3^^ - 28^^ + wp-^] . 

Furthermore, by equating powers of the reciprocal friction e, we can solve a sequence of 
equations 

^oiPlshfi) = gopp , 

^hippho) + ^oipphi) = 9ipp , 

^ll{Ppf2,l)+^o{Ppf2,2) =0, 


to obtain the leading term / 2 ,o, i-e., 

flo = ^ {U''iq) - /3p^U''{q)) . (46) 

Moreover, it can be easily shown that the marginal average of with respect to 

momentum is zero, i.e., 

^ f^,^^^^^^{q,p)ppdiOp = 0, (47) 

which leads to the average of conhgurational observables 4 >{q) with respect to the invariant 
measure as 


{<P{q))BADODAB = (Hi)) + Af2((()(g)/2^^°°°^®) + ©(eAt^ + Af'^). 

Thus, for configurational observables the BADODAB method has fourth order convergence 
to the invariant measure in the large friction and thermal mass limits (i.e., e ^ 0), 

lim((/)(g))BADODAB = {(Piq)) + O(At^). 

£->■0 

It should be emphasized here that only the BADODAB and BAODOAB methods appear 
to have the superconvergence property among a number of different splitting methods investi¬ 
gated in the Ad-Langevin/SGNHT system (1191) with a clean gradient. The superconvergence 
property suggests the use of relatively large cja and p oc a\ in the BADODAB (SGNHT-S) 
method in order to enhance sampling accuracy. In fact, we expect that larger values of p than 
this bound will not diminish the sampling accuracy, but the effect of large values of p is to 
reduce the responsiveness of the thermostat device. 

4 Numerical Experiments 

In this section, we conduct a variety of numerical experiments to compare the performance 
of the different schemes presented in this article. 
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Configurational Temperature 


Average Potential Energy 




Figure 1: Log-log plot of the relative error in computed configurational temperature (left) and 
average potential energy (right) against stepsize by using two Ad-Langevin/SGNHT methods (with a 
clean gradient). The system (cta = 3) was simulated for 5000 reduced time units, but only the last 
80% of the data were collected to calculate the quantity to make sure the system was well equilibrated. 
Ten different runs were averaged to further reduce the sampling errors. The stepsizes tested began 
at At = 0.03 and were increased incrementally by 10% until both methods showed significant relative 
error (SGNHT-N became unstable at around At = 0.08). 


4.1 Molecular Systems 

Before we compare various methods in machine learning applications (i.e., with a noisy gradi¬ 
ent), we first demonstrate the order of convergence of various splitting methods with a clean 
gradient. 

A popular model of an A^-body system with pair interactions based on a spring with rest 
length (i.e., pendulum) was used, a standard if simplified model of molecular dynamics. The 
total potential energy of the system is defined as 

N-l N 

U{q)=Y^ (48) 

2=1 j=i+l 


where rij = ||qi —QjH denotes the distance between two particles i andj, andy?(rij) represents 
the pair potential energy 




k , ,2 

77 (Gj - g) 

0 , 


nj < G ; 
rij > Vc , 


(49) 


where k and Tc represent the spring constant and the cutoff radius, respectively. 

A system consisting of = 500 identical particles (i.e., unit mass) was simulated in a 
cubic box with periodic boundary conditions [1]. The positions of the particles were initialized 
on a cubic grid with equidistant grid spacing, while the initial momenta were i.i.d. random 
variables with mean zero and variance /cpT, which was set to be unity. The thermal mass p 
was chosen to be 10 unless otherwise stated. Particle density pd = 4 was used with spring 
constant k = 25 and cutoff radius Tc = 1. 
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Figure 2: Log-log plot of the relative error in computed average potential energy against stepsize 
by using the SGNHT-S/BADODAB method with (left) different values of cta = 10) and (right) 
different values of /r (ca = 9). The format of the plots is the same as in Figure [T] except 50 different 
runs were used to reduce the sampling errors in high accuracy regime. 


We first compare the two SGNHT methods on controlling two configurational quanti¬ 
ties: configurational temperature and average potential energy. The configurational tem¬ 
perature [22], which, as the kinetic temperature, should in principle be equal to the target 
temperature, can be defined as 


ksT 




where the angle brackets denote the averages, and Vj[/ and V?17 represent the gradient and 
Laplacian of the potential energy U with respect to the position of particle i, respectively (see 
more discussions in |39j ). 

As shown in Figured] with the help of the dashed order lines, we can see that SGNHT-N 
and SGNHT-S show hrst and second order convergence, respectively, as expected. It is clear 
that SGNHT-S has not only at least one order of magnitude improvement in accuracy in both 
observables, but also much greater robustness over the SGNHT-N method, which becomes 
completely unstable at around At = 0.08. The results on the configurational temperature and 
average potential energy are rather similar; therefore in what follows we present only average 
potential energy results. 

We also investigate the effect of changing the value of a a in the SGNHT-S/BADODAB 
scheme proposed in this article. As can be seen from Figure [21 the SGNHT-S method displays 
second order convergence to the invariant measure when cta is relatively small, while a fourth 
order convergence is observed in the high friction limit (cta = 9), as anticipated from the 
analysis of the previous section. It should be emphasized here that the super convergence 
property was observed only in the BADODAB and BAODOAB methods, which both reduce 
to the BAOAB method [341136] in Langevin dynamics. 

Figure m also compares the effect of varying the value of the thermal mass // when cta is 
fixed. It can be seen that the BADODAB method displays a clear fourth order convergence 
when /i is relatively large, while when /i is small, not only is the smooth discretization error 
dependence on stepsize lost, but significantly larger relative error is also observed. This 
reinforces the choice of a relatively large value of /r. It is worth pointing out that fi = W 
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Figure 3: Log-log plot of the relative error in computed average potential energy against stepsize 
by using various splitting methods of the Ad-Langevin/SGNHT system (cta = 3). The format of the 
plot is the same as in Figure [TJ 


works as well as /x = 100; therefore /x = 10 is used throughout this article since a relatively 
smaller /x corresponds to a tighter interaction between the thermostat and the system, and 
thus it can fluctuate more rapidly to accommodate changes in the noise and adapt more easily. 

We also explore in Figure[3]the performance of various splitting methods of the Ad-Langevin/ 
SGNHT system (jlOp with fixed values of cja and /x. All the methods clearly show second order 
convergence, with ABDODBA and BADODAB methods achieving one order of magnitude 
improvement in accuracy compared to the other methods. This again illustrates the impor¬ 
tance of optimal design of numerical methods. The ABDODBA method seems to be slightly 
better that the BADODAB method in the regime of cja = 3; however, as demonstrated in 
Figure O the BADODAB method achieves a dramatic improvement in accuracy when cja is 
relatively large (e.g., cja = 9), while other schemes remain the same except for the BAODOAB 
method. 

4.2 Bayesian Inference 

In this subsection we compare methods in a classical Bayesian inference model in one di¬ 
mension, i.e., to estimate the mean of a normal distribution with known variance m- More 
precisely, given N i.i.d. samples from a normal distribntion, Xi ~ where it should 

be noted that /x is the trne mean, when we draw samples with known and a uniform prior 
distribntion ranging from —N/2 to N/2, we are able to calculate the posterior distribution of 
the mean in a closed form ^ 

/x~AA^x, , (50) 
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where x = the context of stochastic gradient approximation, we have 


7r(/i|X) oc 7r(X|/i)7r(/i) « Jj7r(x^J/i) 7r(/i) 






N 


N 


JJexp 

-i=l 


{Xi - jlf 
2d2 


1 

iv 


— , exp 

V27ro- / \ n 


N {xi — fi)"^ \ 1 
^ ^ ] iV 


oc 


( N {xi — fi) 


2 = 1 
2\ 


= exp 


n 

2 = 1 
1 N 
h 


2 a^ 

N 


y~](xi — x)^ + h(x - fiY 


\i=l 


ocexp( -^{x-fif ) , 


(51) 


where x = 'Y^=iXiln. It clearly recovers the true distribution (I50p when h = N. Taking 
the logarithm and differentiating the posterior distribution obtained at the end of (15 ip with 
respect to fi gives the noisy force 

= . (52) 

In this simple case, the noise of the stochastic gradient is independent of fi and is a constant 
given h. Moreover, we are able to obtain its mean and variance with respect to the stochastic 
gradient [M1I62] : 

N ( 1 ^ \ 

VarF(/i) = ^) varX, 

n 

where VarX is the variance of the dataset. Thus, it is straightforward to verify that the noise 
is normally distributed according to the central limit theorem. 

In our numerical experiments, a a was chosen as 1 due to the fact that large a a results 
in stability issues here. We generated N = 100 samples from AA(0,1) and randomly selected 
a subset of size h = 10 at each timestep to compute the noisy force (1521) . We plot the 
distributions of the posterior mean of the dataset obtained by using four different methods 
with different stepsizes in Fignre[H Clearly, two SGNHT methods completely outperformed 
the SOLD and mSGLD methods. The latter only demonstrate good approximation of the true 
distribution with order of magnitude smaller stepsize compared to the former. But it should 
be noted that mSGLD here is slightly better than SGLD in maintaining the true distribution: 
the distribution of mSGLD with At = 0.001 is visibly much closer to the target compared to 
that of SGLD with the same stepsize. 
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Figure 4: Comparisons of the distribution in a one-dimensional Bayesian inference problem by using 
SOLD (top left), mSGLD (top right), SGNHT-N (bottom left), and SGNHT-S (bottom right) with 
different stepsizes indicated by different colors. The solid black line is the exact solution. Note the 
difference in the legends between rows. 


Note that stepsizes for SGNHT (second order dynamics) and SGLD (first order dynam¬ 
ics) based methods are not directly comparable—as mentioned in [M] the stepsize of a first 
order dynamics method like Euler-Maruyama when viewed as the limiting discretization of a 
Langevin integrator corresponds to At^/2, where At is the stepsize of the Langevin method. 
However, in our experiments we are uninterested in the time-dynamics of the system and care 
only about the invariant measure. Therefore the important relationship is the error in ther¬ 
modynamic averages in comparison with the number of timesteps (work), which quantifies 
the efficiency of a given method. The stepsize is just an arbitrary parameter which allows for 
refinement of the statistical calculation. 

Between the two SGNHT methods, SGNHT-S (the new scheme being proposed here) is 
obviously superior to SGNHT-N: the latter starts to show significant deviation from the true 
distribution at At = 0.02, while the distribution of the former still looks well matched to the 
true one at At = 0.03. Our observations are confirmed by Figure O where the mean absolute 
error (MAE) of the distribution of the two SGNHT methods is plotted. The MAE, which can 
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Figure 5: Log-log plot of the MAE in the distribution of the Bayesian inference model against step- 
size. The box indicates that the system was unstable with corresponding stepsizes for the SGNHT-N 
method. 


be thought of as a relative error in distribution, is defined as 

1 ^ 

MAE = ^ ^ |wi - d)il, (54) 

i=l 

where N denotes the number of intervals, which was chosen as 100. Wj and tUj represent the 
observed frequency in bin i and the exact expected frequency, respectively [M]. As can be 
seen, the stability threshold of SGNHT-N was around At = 0.03, beyond which the system 
became unstable, as highlighted in the figure (in which case the system blew up, resulting in 
a 100% MAE). Once again, SGNHT-S not only shows an order of magnitude better accuracy 
but also has a much greater robustness than SGNHT-N. In particular, for defined accuracy, 
the SGNHT-S method is able to use double the stepsize compared to SGNHT-N, which means 
a remarkable 50% improvement in overall numerical efficiency as defined in [39]. 

4.3 Bayesian Logistic Regression 

Following |62|, we also investigate the performance of different methods for a more complicated 
Bayesian logistic regression model. The data yi G {—1,1} were modelled by 

7r(yi|xi,/3) = f{yiP^Xi) , 

where f{z) = 1/(1 -|- exp(—z)) G [0,1] is the logistic function and x* G are rows of a 
fixed dataset. Our goal is to estimate the posterior mean of parameter vector f3 G For 
simplicity, a multivariate Gaussian prior AA(0, 1) was used on (3. Therefore, by using Bayes’ 
theorem, we obtain the following posterior distribution: 

/I \ ^ 

7r(/3) oc exp {--\\f3f j /(?/i/3'^Xi). (55) 

Following the same procedure in the Bayesian inference example f Section l4.2|) . we can calculate 
the noisy force and then plug it into different thermostats for sampling. 
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Figure 6: Comparisons of the RMSE of the posterior mean in the Bayesian logistic regression model 
by using various methods against stepsize. The system was simulated for 1000 reduced time units with 
100,000 different runs. The stepsizes tested began at At = 0.001 and were increased incrementally by 
30% until all methods either displayed significant error or became unstable (mSGLD and SGNHT-N). 


In our numerical experiments, we considered the d = 3 case with N = 1000 data points. 
We chose the dataset to be 


X = 


a^i,i 

3:1,2 

3:2,1 

3:2,2 

3:1000,1 

3:1000,2 


1 \ 
1 

1 / 


(56) 


where Xjj were sampled from a standard normal distribution M{0 , 1) for i = 1,... , 1000 and 
j = 1,2. A subset of size h = 100 was randomly chosen at each timestep to compute the 
noisy force. 

The performance of estimating the posterior mean value of parameter vector f3 by various 
methods {a\ = 6) was tested and plotted in Figured! Again, SGLD and mSGLD, displaying 
considerably larger root mean square error (RMSE) with a fixed stepsize, were outperformed 
by the two SGNHT methods. In this case, the SGLD and mSGLD methods demonstrated 
similar control in numerical accuracy, but the latter displayed much worse stability than that of 
the former and became unstable just above At = 0.01. As reported in the original paper m, 
the performance of the mSGLD method depends strongly on the size of the subset—for a 
larger subset, which requires higher computational cost, the bias of mSGLD can be smaller 
than that of SGLD. 

Of the two SGNHT methods, the SGNHT-S method again shows not only at least an 
order of magnitude improvement on accuracy but also much better robustness than the other: 
SGNHT-N became unstable just above At = 0.02. Remarkably, the SGNHT-S method at 
At = 0.1 still achieves better accuracy than the SGLD method at At = 0.01. In other 
words, the method we propose here gives more than a 90% improvement in overall numerical 
efficiency compared to one of the most popular methods in the literature. For fixed accuracy, 
the SGNHT-S method can use almost four times the stepsize of the SGNHT-N method (i.e., 
an improvement of about 75% in overall numerical efficiency). 
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5 Conclusions 


We have reviewed a variety of methods in stochastic gradient systems with applications in 
machine learning. We have provided a theoretical discussion on the foundation (underly¬ 
ing dynamics) of those stochastic gradient systems, which has been lacking in the litera¬ 
ture. We have also proposed a new symmetric splitting (at least second order) method in 
SGNHT (SGNHT-S/BADODAB), which substantially improves the accuracy and robustness 
compared to a nonsymmetric splitting (first order) method (SGNHT-N) proposed recently 
in the literature. Furthermore, we have demonstrated that under certain conditions the 
SGNHT-S/BADODAB method can inherit the superconvergence property recently discov¬ 
ered in integrators for Langevin dynamics, i.e., fourth order convergence to the invariant 
measure for configurational averages. 

By conducting various numerical experiments, we have demonstrated that the two SGNHT 
methods outperform the popular SGLD method and its variant mSGLD. In particular, the 
SGNHT-S method can use up to ten times the stepsize of SGLD, which implies a remarkable 
more than 90% improvement in overall numerical efficiency. Between the two SGNHT meth¬ 
ods, the SGNHT-S method can use almost four times the stepsize of SGNHT-N for defined 
accuracy (i.e., about a 75% improvement in overall numerical efficiency). 

It should be noted that in certain cases, it may be desirable to employ a Metropolis- 
Hastings procedure in order to remove the discretization bias |54] . However, we emphasize 
that the correction is not without computational cost, particularly as the dimension is in¬ 
creased [6l[28l[52l[53], and the results of [MHSS] and of the current article demonstrate that 
high accuracy with respect to the invariant distribution is often achievable using traditional 
numerical integration techniques, thus in many cases entirely eliminating the necessity of 
Metropolis-Hastings corrections (see more discussions in [36]). Moreover, we mention that 
the methods of this article can in principle be combined with Metropolis-Hastings algorithms 
if it is necessary to completely eliminate the discretization bias. 
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